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Abstract 

Functional data analysis is a growing research field since more and 
more pratical applications involve functional data. In this paper, we focus 
on the problem of regression and classification with functional predictors: 
the model suggested combines an effi cient dimensi on reduction procedure 
(functional SIR, first introduced bv iFerre fc Yad jioOSi l'). for which we 
give a regularized version, with the accuracy of a neural network. Some 
consistency results are given and the method is successfully confronted to 
real life data. 

Keywords: classification, dimension reduction, functional data analysis, 
multi-layer perceptron, prediction. 



1 Introduction 



Functional regression is now a very important part of statistics as functional 
variables occur frequently in practical applications. We present two examples 
that take place in functional data analysis (FDA). First, a regression problem 
where the regressor are curves is introduced (see Figure [1]) : the Tecator data 
problem (available at http://lib.stat.cmu.edu/datasets/tecator) consists 
in predicting the fat content of pieces of m eat from a near infrared absorb ance 
spectrum. This data set first appears in Borggaard &: Thodberd lll992l) and 
has al so already been studied, among others, in Thodberd lllOOd) . Ferre fc 
1 2OO3I I (with an inverse regression approach) and lFerratv fc Vieul ()2003h . 



[Figure 1 about here.] 

Secondly, in the phoneme data set, the data are log-periodograms of 
a 32 ms duration corresponding to recorded speakers and we expect to 
determine which one of the five phonemes, [sh] as in "she", [del] as in 
"dark", [iy] as in "she", [aa] as in "dark" and [ao] as in "water", corre- 
sponds to this recording (extracted from the TIMIT database and available at 
http : //www-stat . Stanfo rd. edu/~tib s /Elem StatLea rii/dat a. html). It ha s 
already been described bv iHastie et al. (1995) and by Ferratv fc Vieul 1 2003h . 
Clearly, here, functional data is also involved but we face now a classification 
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problem. However, we will see that both - regression and classification - can be 
tackled via a common modelling. 

An extensive review of the numerous studie s deve lopped for functional data 
analysis can be found in iRamsay &: Silverma including regression and 

classification but also many factorial methods. A particularity of functional 
regression is that it often leads to ill-posed problems because of the infinite di- 
mension of the feature space. Then original solut i ons h ave been introduced to 
overcome this problem: for example . ICardot et al. { 1999 ) studied t he functional 



linear r egression. At the sa me time. lDauxois et al 



()200l[ ) and then lFerre fc Yaol 

l 2003l l. lFerre fc Yaol l|2005[ ) lave proposed a semi-parametric model for Hilber- 
tian variables which corresponds to the functional version of Li's Sliced Inverse 
Regression, O ijigQll). 

On a classification point of view, many solutions have been proposed to over- 
come ill -posed functional problems including the popular penalization methods. 
Friedma n ( 1989.) pr esent s the RDA model based o n regularization and shrinkage 



while Hastie etaU 1 1994 ) and Hastie et al. 1 1995l l propose a discriminant analy- 
sis penaHzed by smoothing functionals. O n the other ha nd, it has been used for 
Canonical Correlation Analysis in Leurga ns et al. (Il993h and o ther examples of 
the regularization use are given in lRamsav &: Silverman (1997). 

Nonlinear methods for functiona l data analysis have a l so be en developped: 
for instance, ne ural network mode ls I Rossi fc Conan-Gued ()2005 ') for multilayer 
perceotr ons andlRossi et ali (|20Q4 ) for the SOM algorithm), fc-n earest neighbour 
model s ( Biau et al. I 2005[ )) or non parametric discrimination ( Ferraty fc Vieul 
|2003)). 

In this paper, we propose a new way to achieve functional regression: the 
idea is to join the efficiency of a dimension reduction method using smoothing 
penalization, to the strong adaptability of a neural network which can provide 
highly non linear solutions even if the number of predictors is too large for 
classical nonparametric methods such as kernels smoothing. The functional SIR 
dimension reduction method is first presented in Section [21 For this penaHzed 
version, consistency results are given in Section [31 Section [H discusses Neural 
Network and gives consistency results for the proposed model combining FSIR 
and Neural Networks (which will be called SIR-NNr). Section [His devoted to 
appHcations: Section \5A] deals with the Tecator data set and Section with 
the phoneme data set. In Appendix, we give a sketch of the proofs. All programs 
have been made using Matlab and are available on request. 



2 Sliced Inverse Regression 

Let 1^ be a real random variable and A" be a multivariate variable assumed to 
have a fourth moment. To overc ome t he curse of dimensionality in the nonpara- 
metric regression of F on X, O lll99l[ ) introduced the Sliced Inverse Regression. 
He considers the following model 



Y = f{a[X, a^X, . . . , a'^X, e). 
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where e is centered and independent of X, / is an unknown function and 
{aj)j=i,...,q are lineary independent vectors. 

The space spanned by {aj)j^i^,,,^q is called EDR (Effective Dimension Re- 
duction) space. SIR deals with the estimation of this EDR space and the aim 
of sliced inverse regression is to estimate it by means of the eigenvectors of the 
matrix Var{X)-^Var{E{X\Y)). 

In the multivariate context, numerous works deal with SIR. In particular, 
methods have been proposed to improve SIR: differe nt estimates of t he co - 
vari ance of the c o nditio nal mean have been built (in iHsing fc Carroll (1992) 



and IZhu fc FangI l|l996r )) while other methods have been propo sed to esti- 
mate the EDR space (f or example, P HD propo s ed by O ()l992l ). SAVE by 
Cook fc WeisbergI l|l99l[ ) or MAVE by IXia al\ l|2002l ll. The main interest 



of this model is that, once the EDR space is estimated, the estimation of / is 
obtained very easily with traditional techniques provided that q is not too large. 

2.1 Functional SIR 

Now consider a real random variable Y and X a random variable taking its 
values in the space of squared intregrable functions from a compact interval 
T into E. With the usual inner product defined by, for all f,g in {f,g) — 
f{t)g{t)dt, £^ is a Hilbert space. We will assume that the random variable 
X is centered, without loss of generality, and has a fourth moment. Then, the 
covariance operator of X exists and is defined by Tx = E{X(^X) where X(^X 
denotes the operator which associates to any / in (f,X)X. We also get 
that E{X\Y) and Te(x\y) = Var{E{X\Y)) exist. Ferre and Yao (2003) have 
proposed to investigate the following model for functional inverse regression: 

y = /((X,ai),...,(X,a,),e) (1) 

where / is an unknown function, e a random variable which is centered and 
independent of X and {aj)j=i^...^q are lineary independent functions of 

The crucial point of functional SIR is that, unHke the multivariate case, 
r^^^ is not defined since we have to assume that Tx is a positive defi- 
nite operator which implies that it is not invertible as defined from 
to However, if we call ((5i)i=i,...,oo its sequence of eigenvalues and 

(ui)i=i,...,oo those of orthonormed eigenvectors, Rr the image of Fx and R^^ — 
{h e : 3f e Rr, h ~ X]i(V'5i)("j ® Tx is a one-to-one mapping 

from i?p^ to Rr whose inverse, called F^^, is defined by F^^ = ^-(l/(5i)Mi (g) Ui. 

We focus on the estimation of the estimation of the EDR space spanned by 
the vectors (aj)j=i,...,<j. Now, the key of the method comes from the following 
theorem: 



Theorem 1 ijFerre fc Yaol ^200± ). Writing A = ((X,ai), . . . , {X,ag))^, if 



(Al) for all u in Cif there exists v in such that: E{{u, X)\A) — A 



then E{X\Y) belongs to the suhspace spanned by F^ai, . . . ,Txaq- 
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Remark 1. Note that lCook &: Weisberg ljl991l ) show that elhptically distributed 
variables satisfy condition (Al) in the multidimensiona l context b ut this can 
be transposed in infinite dimensional Hilbert spaces (see lYaol (2001)). 



By using the result of Dauxois et al. ( 200lh . a consequence of Theo- 
rem [T] is that the EDR subspace contains the Fx-orthonormed eigenvectors 
of T^^rE(x\Y) associated with the q positive eigenvalues. Then, in the follow- 

ing, (a i)i^i a will denote those eigenvectors. This is the generalization of O 

1 199ll l on SIR to infinite dimensional case. 

A basis of the EDR space is thus given by the eigenvector of T]^^Te{x\y) 
but to ensure that the se eigenvectors exist in we have to assume that 
fsee lFerre fc Ya3 l|200,'Th for details^ J2,J2j l/{SiSj)E{E{(:i\Y)E{Q\Y))^ < +oo, 
where X = J2i d^i is the Karhunen-Loeve decomposition of X. 

Let {{X" ,¥"■)}„=!,. ..,N be an i.i.d. sample. In order to estimate the EDR 
space, we have to choose an e stimate for Te(x\y)- We propose a slicing ap- 
proach: in iFerre fc Yaol (|2003h . the estimate is obtained by partitionning the 



domain of Y in {Ih)h=i,...,H and by setting ^e(x\y) ~ 



— X 1^ X, where. 



E{X\Y) 

if I is the indicator function, Nh — ^ 



N 



{i/Nh)J2n=i'^^^{Y"eih} ^'^d is the empirical mean. Another ap 



proach, based on a kernel estimate, has been developped in lFerre &: Yaol (|2005f ) 
Although this could be used in our context, we focus on a slicing approach for 
the sake of simplicity. 

A usual estimate of Tx is = (1/iV) ^^^^ X" (g) X" - X (g) X, but this 
estimate is ill conditionned (because is not a bounded operator) so the eigen- 
vectors of ^x)~^^E(x\Y) 'lot converge to the eigenvectors of e(x\y)- 
Th at is the reas o n why penalization or regular ization is ne eded. 

Ferre &: Yaol ( 20031 ) suggest to proceed like Bosq ( 1991 ) by considering, in- 



stead of Tx, a sequence of finite rank operators with bounded inverses and 

of (aj)j=i,...,, that. 



converging to Tx- This leads to the estimates {aj')j= 



N 



a, 



0. 



under some conditions, satisfy || — u,j \\—'p 

The authors also suggest a way of estimating the EDR space f or functional 
data w ithout inverting the covariance operator of the regressor (jFerre &: Yaol 
(200i)). 

We propose, in Section |3l a regularized approach by penalization. 



2.2 SIR for classification 

Let Ci , . . . iCff be H groups. When Y is multidimensional, the results of 
Dauxois et al\ (|200lh are still available and by setting Y = ilc^ , . . . , Ich ) i where 



is the indicator function of the hih group. Model (HJ remains valid and we 
get a n atural way to include classification problems into FSIR, see lFerre fc Villa! 
1 200,'tI I. Note that, in the functional case, multivariate methods for discrim- 
ination have been extended, mainly inspired from Linear Discriminan t Ana l- 
ysis (LP A). In th is ar ea, let us ment i on th e works of iHastie et al\ (|l994( ) , 
Hastie et al. ( 1995 ) and James &: Sugar ( 2003h . 
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Now, by estimating Te{x\y) by 

. = ^ I] N^E{X\Y = h)<» E{X\Y = h)-X®X 

h=l 

where = and E{X\Y = h) ^ {l/N^) E^Li ^"I{F"=h}, FSIR 

leads to a discriminant analysis. The estimation of the EDR space is identical to 
the discriminant space in linear discriminant analysis. However, the estimation 
of / leads to a natural classification rule. Indeed, since we have, for all x, 
fix) = E{Y\X = x) = {P{Ci\X = x),...,P{Ch\X = x)), the estimation of / 
coincides with the estimation of the probabilities of the groups conditionally to 
X. 



3 Regularized functional SIR 

In Section[2l we saw that the EDR space contains the eigenvalues of the operator 
ri^^r E(x\Y)- Thus, as it is the case for Discriminant Analysis, the estimator 
of the first direction of the EDR space can be found by maximizing a Rayleigh 
criterion: maxa{T e{x\y)0., o) / (^xa, o). Unfortunately, as is ill conditionned, 
the maximization of the empirical Rayleigh expression does not lead to a good 
estimate of the EDR space: that is the reason why a regularization is needed. 

Provided that we have smooth functions, a relevant method for functional 
data is to penalize the covariance operator in the Rayleigh expression by in- 
troducing smoothing constraints on t he estimated functi ons. This method has 
already proved its great efficiency (see Hastie et al. 1 19951 ) for an example of the 
penalized discriminant analysis). 

3.1 Main result 

Let S be the subspace of C'ij- of functions with a squared integrable second 
derivative. We introduce a penalty through a bilinear form defined on 5 x 5 
by, for all f,g in £ S, [f,g] — D'^ f{t)D'^g{t)dt. We also define the penaHzed 
bilinear form associated with empirical operators Tx and F^: 

QM, 9) = (r^/, g) + g] and (/, g) = (F^/, 5) + «[/, g] 

where a is a regularization parameter. The solutions of the regularized FIR are 
given by maximizing, under orthogonal constraints, the function 

{T^a, a) + a[a, a] 

In order to obtain consistency results for the estimates of (aj)j=i,...,g, we 
make the following assumptions: 
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(A2) E{\\ X r) < +CX3; 

(A3) for all a > 0, inf||a||^i^ aesQaia.o) — Pa > 0', 

(A4) ^E{x\Y) ^ continuous operator which converges in probabil- 
ity to Te(x\y) with Vn rate; 

(A5) liniAT^+oo a = 0, liniTv^+oo VNa = +oo; 

(A6) (aj)j=i^...^g belong to S and verify, for all u such that 

{Txu,ai) — and that {Txu,u) = 1, {Te(x\y)U,u) < 
{rE{X\Y)(^2,CL2) — X2 < Ai. 

Since, S is not a closed subset, 7^ could not reach a maximum on S. How- 
ever, the following result holds: 

Theorem 2. Under assumptions (A1)-(A6), with probability converging to 1, 
the function 7^ reaches its maximum on S when N grows to +00. 
In this case, let then be a vector of S for which 7^ is maximum and which 
is such that {rxCL^,ai) — 1- Then, 

{Txia-i - ai), af - ai) 0, 

when N tends to +00. 

Remark 2. For an understandable presentation, we introduce a particular type 
of penalization but previous results can be found for other regularization func- 
tional satisfying the assumptions. For example, we can replace the bilinear form 



[., .] by another one which is similar to the one used in Ridge-PDA (jHastie et al. 
1199,'tIII. 



Remark 3. Assumptions (A2), (A3) and (A5) are technical assumptions that 
ensure the existence and convergence for (a^)j=i,...,q: (A2) impHes that will 



converge to Tx at the V^V rate; we can find in lLeurgans et al\ 1 1993) conditions 



that involve (A3). This assumption shows the purpose of regularization: it 
controls the scaHng of Qa and, thanks to (A5), ensures that the denominator of 
7^ doesn't go too fast to 0. Finally (A5) gives a way of choosing regularization 
parameter a (for pratical aspects see section [33| . 

Remark 4. When working with a compact operator T, the ridge regularization 
T + al (where / denotes the identity operator) always leads to inf||Q||^i((r -|- 
al)a,a) = pa > which is exactly assumption (A3). Here, the regularization 
applied to Tx is not the ridge one but is more adapted to the smoothness of 
the data; an intuitive meaning of this is the ridge regularization of a D^TxD~^ 
type operator (see also section 13.21 for a consequence of this penalization and 
the link with assumption (A3)). 

Remark 5. Assu mption (A5) is fullfilled by the usual estimates introduced 
above: 199 il l emphasized the fact that the sliced estimate is consistant, 
with rate VN, for the variable {Y £ 'Ih)h=i,...,H which satisfies assumption 
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(Al) as Y does. Ferre 8z Yaol 1 20051 ) proved the consistency of the Nadaraya- 
Watson estimate of Te(x\y) and the hilbertian Central Limit theorem ensures 
the consistency of the estimate given for the classification case. 



3.2 Practical aspects 

On a practical point of view, X has been observed at some points ti, t2, . . . , 
to (for an understandable presentation, we suppose that these observations 
have been centered). The optimization of the penalized Rayleigh expression 
described in Section 13.11 can be performed by using, for example, B-SpHnes 
{Bi)i to parametrize : 



af(t) = ^^HS,(<) = Aii3 

i 



where B is the matrix containing the values of {Bi{t))i at the points ti, t2, • • • , 
ti). Similarly, the matrix of observations X = {X"{td))n=i,...,N, d=i,...,D can be 
written in the form of B-Splines: X = CB with C = [C\ C^]'. Let B^^) 
be the vector containing the values D^B{t). 

If we use the slicing estimate of ^e{x\y) for regression, we introduce, 
for all h = 1,...,H, Yh = [l^yig/^i,, I^Y^ei^}]'- Then, the prob- 
lem of maximizing 7^ is equivalent to maximizing {A' Me A)/ {A' Mx, a A) 
where is the estimator of Te(x\y) obtained by the slicing approach: 
Me = J2h=iiNh/N)BB'C'YhY/^CBB' and where Mx,a = {\/N)BB'C'CBB' + 
aB^'^^ 'ij(2) This expression underlines the role of the penalization: the matrix 
{1/N)BB'C'CBB' is usually ill-conditionned (because of the high-dimension of 
the data) and have tiny eigenvalues (that can even be equal to 0). Provided 
that B^^^ 'B^^) is invertible, the eigenvalues are rescaled in a basis depending on 
and are minored by a strictly positive number depending on a: assumption 
(A3) is then practically fullfilled. 

The first solution is the eigenvector, with Mjs:,a-norm equal to 1, associated 
with the largest eigenvalue of the matrix M^\Me. By pursuing the procedure 
under othogonality constraints, we get that the other solutions are the Mx,a- 
orthonormal eigenvectors of M^^Me- 

If we deal with classification, the same procedure is achieved by letting Yh = 
[l{yi=/i}, . . . , I{yiv=/i}] . 

Finally we have to find the optimal value for a. This can be done, if the 
sample is large enough (which is the case in the presented applications), by 
dividing it into two parts: we apply the previous procedure on the first part to 
find (aj^)j and evaluate the error committed by Model ([T]) on the second part; 
the best parameter is then chosen to minimize this error. 
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4 Multilayer perceptrons 

4.1 Approximation by multilayer perceptrons 

After the EDR space is estimated, the goal is to get an estimation of the function 
/ in (dJ): we propose to use a feedforw ard neural network with one hidden layer. 
This method (see, e.g., BishopI 1 19951 ) for a review on Neural Networks) is an 



alternative to other nonparametric regressions if the dimension of the EDR 
space is too large. It has the advantage of working in any cases while some 
nonparametric methods, such as kernel smoothing or splines smoothing, face 
the curse of dimensionality. 

The main interest of neural networks is their ability to approximate any 
function with the desired precision (universal appr oximation); see, fo r in- 
stance, Honiik (1993) for th e multivariate context and lStinchcombe and 



Rossi fc Cona^Guei l|2005h in the infinite dimensional one. 



4.2 A consistency result 

Multi-layer perceptrons app roximations of funct io nals in infinite d imens ional 
spaces have been stu d ied in IChen fc Chen ||199,'tI 1. ISandberg fc Xul and 



Rossi fc Conan-Gued ( 2005 ). Several strategies are available either by directly 
using the curves as inputs of the feedforward neural networks or by first project- 
ing the data onto a classical functional basis (such as a spHne basis, a Fourier 
basis, w avelets) or a basi s derived from the PCA of X. This latter approach is 
used bv lThodbergI l|l996l ). 



Our approach is similar but, instead of projecting the data onto a fixed basis 
or a principal component basis, we project them onto the EDR space. The EDR 
space behaves as an efficient subspace for the regression of F on X and it is 
a way to get a basis which takes into account the relationship between Y and 
X. In fact, the data are projected onto an estimation of the EDR space, so the 
accuracy of the projection and then the estimation of the optimal weights for 
the neural network also depend on how good the EDR space is estimated. 

We construct a perceptron (see Figure [2]) with one hidden layer having 

• as inputs, the coordinates of the projection of AT onto Span{(aj)j=i_... ,j}: 

(X,ai), {X,ag); 

• q2 neurons on the hidden layer (where q2 is a parameter to be estimated) ; 

• as outputs, one neuron for regression and H neurons for classification, 
representing target Y. 

[Figure 2 about here.] 

The output of such a neural network is then 
S^li (X]j=i (^''^j) + where g is the activation function 

(for example a sigmoid). The purpose of the training step is then to find w* 
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which minimizes a loss function L between the output of the neural network 

with weights w = (^(wf ^)i=i,...,<j2, {w^i^j)iZi,'.'.7,q2^ (^j°'')i=i,...,<72) > and the target 
Y: 



w* = arg min < E 



(2) 



Actually, we obtain an estimation w]^ of w* by 



N 



Wj^ — arg mm ■ 



(2) 

wl ' g 



n=l 



w 



White! l|l989h gives a consistency theorem for the weights of a neural networks 



estimated by a set of iid observations. Since {af)j is an estimation of the EDR 
space deduced from the whole data set {{X'", y")}„, the inputs of our functional 
perceptron used to determine do not satisfy the iid assumption and a proper 
consistency result is then needed. 

Let us introduce some notations: ^ is the function from O x W (O is an 
open set of E'+^ and W is a compact set of m('?+2)92^ g^cij ^s for all z — 

{u, y) in O, C,{z, w) ^ L (j^^Li 'ff ^^'ij + ^v); Z is the couple 

of random variables {{{X,aj)}j,Y) and {Zn)n=i,...,N are observations of Z; 



finally, (^]^)n=i,...,7v are the couples of ({(X", oj^)}^ , F"). In our context, the 
consistency of the Multi-layer Perceptron is given by the following theorem: 

Theorem 3. Under assumptions (A1)-(A6) and the following assumptions 



(A7) for all z in O, ({z, .) is continuous; 

(A8) there is a measurable function C from O into R such that, for 

all z in O, for all w in W, \((z,w)\ < C(z) and E{C,{Z)) < +oo; 

(A9) for all w in W, there exists C{w) > such that, for all {x, y) 

and {x\ y') in O, Uix, y), w) - C((x', y), u;)| < C{w) \\ x - x' \\ 

(AlO) for all w in W, Ci-iw) is measurable. 



If W* is the set of minimizers of the problem then 

as N tends to +oo with d defined by: d{w, W) = iid^^w W w — w \\ where 
is the usual euclidean distance. 
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Remark 6. This list of assumptions is, for example, verified by a perceptron 
with one hidden layer and a sigmoid function g{x) = e^/(l + e^) on the hidden 
layer associated with the square error L[tp,y) =|| — y |p provided that Y is 
bounded. 

Remark 7. Assumptions (A1)-(A6) ensure the convergence of (a^)j=i^...^g to 
(aj)j=i,...,q but they can be replaced by a list of assumptions implying the same 
result. For example, we would have the same consistency result by projecting 
the data on the est imat ed EDR space found by the functional SIR presented in 
Ferre fc Yaol l|2003l l and lFerre fcYaol ()2005l ). 



5 Applications 
5.1 Tecator data 

As already said, the Tecator data problem consists in predicting the fat content 
of pieces of meat from a near infrared absorbance spectrum. We have N = 215 
observations of {X, Y) where X is the spectrum of absorbance discretized at one 
hundred points and Y is the fat content. 

In order to compute the procedure described in section [321 we project the 
data onto a cubic Spline basis. Because of their smoothness, these data are very 
well projected onto a basis with 40 equally spaced knots (actually, when using 
40 equally spaced knots, or more, the interpolation of the observations by the 
SpHne basis is exact); then, for simplicity reasons, we used this projection for 
the computation when needed and used the original data in the other cases. We 
tried several classical methods in order to test the efficiency of SIR-NNr. The 
competitors are: 

• SIR-NNr: the functional SIR regularized by penalization, presented in 
Section [3l precedes a neural network. The neural network training step is 
made by early stopping procedure: the learning sample is divided into 3 
samples (training / vafidation / test); the training sample is used to train 
the neural network, the validation sample for an early stopping procedure 
(when the validation error increases, training is stopped) and this training 
step is performed 10 times. The best performance of the test sample gives 
the optimal weights; 



• SIR-NNk: here w e use the sraqothed functional inverse regression 
method presented in iFerre fc Yaol ()2Q03h as pre-processing to a neural 
network; the purpose is to show the benefit of the regularization. The 
neural network is also trained by early stopping; 



• PCA-NN: in order to show the advan t age of SIR, we compute a prin- 
cipal component analysis (as Thodber"^ ( 1996h ) before a neural network 
procedure is used (a classical neural network while Thodberg uses a so- 
phisticated bayesian neural network); 
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NNf : this method is the functional neural network (the Spline projections 
are used to represen t the functional weights and inputs) described by 
Rossi fc Conan-Guez 1 20051 ). In this paper, B-Spline basis projection is 



selected by cross-vaHdation which leads to a huge computational time: we 
do not follow this approach and use the cubic basis with 40 knots; 

• SIR-L: after projecting the data onto the EDR space determined by regu- 
larized SIR, we compute a linear regression in order to show the efHciency 
of a neural network compared to a classical parametric method. 

We also have to notice that some classical nonparametric methods, such as 
kernel estimates which depend on the euclidean norm, can not be used for this 
data set as the dimensionality of the EDR space is too large compared with the 
number of data (the value of q is given in Table [T| . 

Before we compare the different methods and in order to limit computational 
time, we determined the best parameters for each one. Our sample is divided 
into two parts: on the first one, we determine the values of {af)j and of the 
weights of the neural network for various values oi a, q and 92. On the second 
part, we determine the standard error of prediction (SEP): the "best" parameters 
are those which minimize this SEP (see Table [1]) . 

[Table 1 about here.] 

Then, in order to see, not only the error made by each method, but also 
its variability, we randomly build 50 samples divided as follows: the learning 
sample contains 172 observations and the test sample contains 43. All five 
methods are first trained on the learning sample (with their optimal parameters 
pre-determined as described above) and the standard error of prediction (SEP) 
is then performed on the test sample. 

Figure [3] gives the boxplot of the test errors for the 50 samples. 

[Figure 3 about here.] 

These results show the excellent performances obtained by SIR-NNr: its 
SEP average over the 50 samples is twice lower than any of the other competi- 
tors. Moreover, this method garantees a good stability unlike the others. SIR 
seems to be a very good pre-processing stage, as SIR-NNk also obtains good 
performances. Then we have NNf but its rather good results suffer from a very 
slow computational time. To show this, we give the computational time of each 
method: when SIR-NNr takes 100 seconds per sample, NNf takes 350 and SIR-L 
only 1. Clearly NNf is very expensive while SIR-L is very fast but works poorly. 
Actually, it is closely related to the number of inputs: 42 for NNf and 20 for 
SIR-NNr. 



5.2 Phoneme data 



In this section, we compare our methodology with other approaches on a clas- 
sification problem, namely the phoneme data. The data are log-periodograms 
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of a 32 ms duration corresponding to recorded speakers; it deals with the dis- 
crimination of five speech frames corresponding to five phonemes transcribed as 
follow: [sh] as in "she", [del] as in "dark", [iy] as in "she", [aa] as in "dark" and 
[ao] as in "water". Finally, the data consist in 4 509 log-periodograms of a 256 
length (see Figured]). 

[Figure 4 about here.] 

We tried several classical methods in order to test the efficiency of SIR-NNr 
which is compared with: 



SIR-NNp: a classical SIR as presented in lFerre &: Yaol h()Q± as prepro- 
cessing of a neural network; 

• SIR-K: a regularized functional SIR where the function / is estimated by 
a nonparametric kernel method; 

• Ridge-PDA: the penalized discriminant analysis introduced in 
Hastie et al. l)l995l l which uses ridge penalty; 



• NPCD-PCA: a nonparametric method using kernels a nd semi-metrics 
based on Principal Component Analysis and introduced by lFerratv &: Vieul 

(doos). 

The optimal parameters for these methods, choosen as in the previous ex- 
ample, are shown in Table l2j 

[Table 2 about here.] 

For the SIR stage, the optimal dimension of the EDR space is set to 4: it is the 
maximum dimension possible as the operator ^e{x\y) rank H —1. We can 
also see that this dimension is relevant by looking at the projection of the data 
onto the EDR space (for SIR-NNr, for example, see Figure Hj): only the fourth 
axis is able to separate the phonems [aa] and [ao]. 

[Figure 5 about here.] 

Then we randomly build 50 samples divided as follows: the learning sample 
contains 1 735 log-periodograms (347 for each class) and the test sample contains 
also 1 735 (347 for each class). All five methods are first trained on the learning 
sample and the test error rate is then computed on the test sample. Figure [6] 
proposes the boxplot of the test error rates. 

[Figure 6 about here.] 

The results of SIR-NNr, SIR-NNp and SIR-K are very close. The benefit of 
SIR is highlighted since those three methods work better than others based on 
different projections of data. The advantage of regularization is also revealed 
since it leads again to the best results. Then comes RPDA and finally NPCD- 
PCA which provides the poorest performances. On the contrary, due to a low 
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dimensionality, neural networks seem to be less performant than kernels and to 
have a bigger variability (standard deviation is 0.56 for SIR-NNr and only 0.40 
for SIR-K): this problem can be removed by increasing the number of training 
steps, by using more sophisticated architecture or a regularization technique 
(such as weight decay) but at the price of a larger computational cost. Finally, 
if SIR-K obtains the best mean (8.09 % versus 8.21 % for SIR-NNr), SIR-NNr 
is the method which reaches the best minimum which shows its great potential. 

In conclusion, both on regression and classification problems, regularized 
SIR-NN is a competitive solution for functional problems: we can explain these 
good results by noting that the procedure combines an efficient dimension re- 
duction model and the great accuracy of a neural network, which is able to 
approximate almost every function. Thus this model can be efficient both for 
ill-posed problems thanks to the penalized fimctional and for problems with a 
large dimensionality thanks to the neural network step. Finally it has another 
great advantage: computational time is rather short and does not increase too 
much with the number of observation points for the curves. 
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A Appendix 

Here we give the main lines of the proofs of Theorems [2] and [3l 
A.l Theorem [2] 



The p roof of this theorem is related to the one of Theorem 1 in lLeurgans et al. 
1 19931 ) and only sketches are given. 



Lemma 1: Using Central Limit Theorem, it is easy to show that if 5^ — 

max{|||r;^ — Tx\\\] \\\^%[x\y) ^ e(x\y)\\\] and if the sequence {kM)N satisfies 
//Vfcjv +00 then k'^S^ 0. 

Existence: We have for a in [0, 1], Qa = (1 — a) (Fx ., .) -f- aQi and then, 
for all u such that |j u ||= 1, (l/a)Qa{u, u) > (l/a — l)(Txu, u) + Qi > pi by 
the positiveness of Tx- Then, \/Npa > a^/Npi and we have 



Npa +00 . (3) 

Then, by Lemma 1, noting = — Fx, 

HmAT^+oo P {{ijJ & ^ ■ \\\A^\\\ < (1/2)pq}) = 1 (where ft denotes the probabil- 
ity space on which X and Y are defined). But, we have 

{ioen: \\\A^\\\ < ipj C L : V a e 5, II a Ih 1, (a, a) > ip„ > o| 
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and finally the right hand part of the previous equation has a probability con- 
verging to 1 when N converges to +00. 

Let B{0, 1) be the weak closure of {a € 5 (a, a) = 1} and C be the 

^a, a), then C. 



E{X\Y 



functional defined on {a G 5 (a, a) ~ 1} by (^{a) = {T 

can be extended to a uniformly continuous functional C defined on B{0, 1) for 
the weak topology. Finally, provid ed that Q^{a,a) > {l/2)pa, C reaches its 
maximum on weak compact B{0, 1) which concludes the proof of the existence 
of (aj^)j=i,...,g. 

Consistency: For the following, we suppose that we consider a. lo £ fl such 
that Cj E {uj e : 7^ has a maximum on S and reaches it}. Let — Xi{w) 
be this maximum and A" be the maximum of 70(0) = (r£;(x|y)a, a)/((rxa, a) + 
a[a,a]) on S; A" is well defined thanks to assumption (A3). 

Considering 70(0) /70(a), we easily show that 



A?^Ai. 

Then, by proving that sup^g^ l7^(a) - la{a)\ 

I Af ~ A" I — >p 0. 
Finally, by combining Q and |[5|), we conclude that 

Ai — >p Al 

Then, by using we demonstrate that 

7(af ) ->p Al = 7(ai). 
Thanks to the conclusion of Theorem [T] 



0, we can show that 



(4) 



(5) 



(6) 



(7) 



_ we show that 
limAr^+ooP((r_E(xtY)ai,af - ai) = {Txai,a^ ~ ai) = 0) = 1. Let 
^iN be {Tx{ai - ai),af - oi); if {i^E{x\Y)ai,a^ - ai) = 0, we have 
ArS(af ) < (l + Aj^^A2^Ar)/(l + /iAr). As Aj"^A2 < 1, the right hand side of the 
previous inequality is less than 1; but A^^7(a^) converges in probability to 1 
by so (1 + Af iA2/iAr)/(l + hn) 



>p 1 and then we conclude with fi^ 0. 



A. 2 Theorem SI 



The proof of this theorem is close to the one found in Rossi &: Conan-Guez 



1 20051 ): the main difference is that the projection for the data is a random 
variable. The proof will be divided into two parts: 
We first prove that 



sup 

wew 



1 ^ 

n=l 



0. 



(8) 



Forall w in W, we have 



1 ^ 
N ^ 

n=l 
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< 



TV N 
n—l n—1 



1 ^ 



For proving that 



a gener al Uniform Strong Law o f Large Numbers, 
given in iRossi fc Conan-Guez ll20Q5ll and, bv a ssump tions (A7), (A8) and 
(AlO), Corollary 3 of iRossi &: Conan-Guea (20051) directly implies that 



a.s. 0, we need 
Such a result is 



Using assumption (A9) we see that 



<C{w) EUi^xia' 



As 



Txlll — >p and as, for all j = 1, . . . , q, (Txia 



we then conclude that sup 



1/2 



the same reference as above), which finally implies |[8]). 

Secondly, let e be a positive real. According to the Dominated Convergence 
Theorem, E{({Z, .)) is a continuous function which reaches its minimum m on 
compact set W. Then we can show that there is a 77(e) > such that, for all w 
in W, 

\E{CiZ,w)) -m\ <r] d{w,W*) < e. (9) 
Then let flr,,N be the following subset of Q 



w e O 



sup 

wew 



1 ^ 
N ^ 

n=l 



C{Z^i,w)-E{C{Z,w)) 



< 



If w e flrt,N then, as W is a compact set, we can find, for all iV e N, w'}^{lj) e W 
which minimizes (l/N) J2n=i C(^]v(^)i w). Let w* be in the closure of {'w]^)n', 
then by arguments similar to the ones used in the first part of the proof we show 
that, for all to € 0^,Ar and for all w €W, E{C{Z, w*)) < E{C{z, w)) + r/, which 
impHes by the use of ^ that il,,,7v C {lu d(w*(w),yV*) < e} and this concludes 
the proof as limAr^+oo P{^ri,N) — 1. 
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Figure 1: The regressor curves 
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Figure 3: Tecator data set: SEP for 50 samples 



22 L. Ferre and N. Villa 



Scand J Statist 




Scand J Statist 



Functional MLP 23 



SIR— NNr — Axis 1 x Axis 2 



♦ 


[sh] 


• 


[iy] 


V 


[del] 





[aa] 


+ 


[ao] 



SIR— NNr — Axis 2 X Axis 3 



SIR — Nl\lr — Axis 3 x Axis 4 




Figure 5: Projection onto the EDR space of 50 log-periodograms by class 
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Figure 6: Phoneme Data: Test error rates for 50 samples 
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Table 1: Best parameters for the five compared methods 





Parameter 1 


Parameter Z 


Parameter 3 


PCA-NN 


^ 25 
(PCA dimension) 


92 = 12 
(number of neurons) 




NNf 


92 = 18 
(number of neurons) 






SIR-NNr 


a = 5 
(regularization of Fx) 


g = 20 
(SIR dimension) 


q2 = 10 
(number of neurons) 


SIR-NNk 


h = 0,5 
(kernel window) 


q= 10 
(SIR dimension) 


q2 = 15 
(number of neurons) 


SIR-L 


a = 0,5 
(regularization of Fx) 


g = 20 
(SIR dimension) 
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Table 2: Best parameters for the five compared methods 





Parameter 1 


Parameter 2 


Parameter 3 


SIR-NNr 


a = 10 
(regularization of Fx) 


9 = 4 
(SIR dimension) 


92 = 15 
(number of neurons) 


SIR-NNp 


k„ = 17 
(PCA dimension) 


q = 4 
(SIR dimension) 


q'z = 12 
(number of neurons) 


SIR-K 


Q = 10"^ 
(regularization of Fx) 


9 = 4 
(SIR dimension) 


h = 1 
(kernel bandwidth) 


RPDA 


a = 5 
(regularization of Tx) 


q=4 
(PDA dimension) 




NPCD-PCA 


kn = 7 
(PCA dimension) 


h = 25 
(kernel window) 





